MSC. biology, currently working at SYKE, ca. 5 years experience. Main interests: sustainability at macro scale, national EE-IO analysis, food system sustainability modeling, circular economy..
In my projects at SYKE we’re utilizing R in various capacities, on a daily basis. I’m expecting to expand my skill level concerning R, and maybe find new tools/packages to use in my work. Maybe this’ll be easier when there’s some structure to the learning.
Our team just implemented Git as well, so it’s nice to get a bit deeper into it as well.
Additionally, the statistical component will most likely benefit me as well. I wish to expand my repertuare of analytical capabilities overall. At this point, I’m not looking to learn any single particular analysis type (since it’s hard to tell what will be useful in the coming years), so it’s good to have a solid reference base
Initial impressions suggest this would’ve been a good learning material when I first started dealing with R. The basics are made simple to understand. but the scope seems to go quite a bit deeper than that. There seems to be plenty to learn and I’m tempted to treat this as a reference material in future too (or a tutorial for colleagues, if I’m allowed to share it outside the course?)
## [1] "Chapter complete on: Tue Nov 21 16:39:40 2023"
Describe the work you have done this week and summarize your learning.
if (!require(here) == T) {
install.packages("here") #Checks if the package is NOT installed. If this evaluates to TRUE, runs installation.
}
library(here)
if (!require(tidyverse) == T) {
install.packages("tidyverse")
}
library(tidyverse)
if (!require(GGally) == T) {
install.packages("GGally")
}
library(GGally)
ggpairs(Data, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
x<-lm(Points ~ deep+Attitude+Age, data = Data) #Create model, assign a name "x" to it
summary(x)
##
## Call:
## lm(formula = Points ~ deep + Attitude + Age, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.206 -3.430 0.204 3.979 10.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.60773 3.38966 4.605 8.32e-06 ***
## deep -0.60275 0.75031 -0.803 0.423
## Attitude 0.35941 0.05696 6.310 2.56e-09 ***
## Age -0.07716 0.05322 -1.450 0.149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.307 on 162 degrees of freedom
## Multiple R-squared: 0.2043, Adjusted R-squared: 0.1896
## F-statistic: 13.87 on 3 and 162 DF, p-value: 4.305e-08
z<-lm(Points ~ Attitude, data = Data) #Create model, assign a name "x" to it
summary(z)
##
## Call:
## lm(formula = Points ~ Attitude, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
plot(z, which=c(1,2,5))
source(here("Scripts/create_data_assignment3.R"))
taulukko<-alc %>% group_by(high_use) %>% summarise(`Average absence` = mean(absences),
`St.Dev, absences.`=sd(absences),
`Average health` = mean(health),
`St.Dev., health`=sd(health),
`Average failure` = mean(failures),
`St.Dev., failures`=sd(failures),
`Average family relations` = mean(famrel),
`St.Dev., family relations` = sd(famrel))
gt(taulukko) %>%
fmt_number(decimals=2) %>% cols_align("center") %>% opt_stylize(style=1, color="green") %>% tab_header("Summary table")
| Summary table | ||||||||
| high_use | Average absence | St.Dev, absences. | Average health | St.Dev., health | Average failure | St.Dev., failures | Average family relations | St.Dev., family relations |
|---|---|---|---|---|---|---|---|---|
| FALSE | 3.71 | 4.46 | 3.49 | 1.41 | 0.12 | 0.44 | 4.01 | 0.89 |
| TRUE | 6.38 | 7.06 | 3.73 | 1.40 | 0.35 | 0.75 | 3.77 | 0.93 |
ggplot(alc, aes(x = high_use, y = absences)) + geom_boxplot() + ggtitle("Distribution of `absences` values by group")
ggplot(alc, aes(x = high_use, y = health)) + geom_boxplot()+ggtitle("Distribution of `health` values by group")
| Frequency of `failure` values by group | |
| failures | n |
|---|---|
| FALSE | |
| 0 | 238 |
| 1 | 12 |
| 2 | 8 |
| 3 | 1 |
| TRUE | |
| 0 | 87 |
| 1 | 12 |
| 2 | 9 |
| 3 | 3 |
# find the model with glm()
m <- glm(high_use ~ failures+absences+health+famrel, data = alc, family = "binomial")
# create model summary and extract the coeffs
summary(m)
coeffs<-m %>% summary() %>% coefficients()
OddsRatio<-exp(coeffs)
ConfInt<- confint(m)
Result<-cbind(OddsRatio, ConfInt)
print(Result)
## Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
## (Intercept) 0.4634471 1.790547 0.2670735 1.205335 -1.93186722 0.36139875
## failures 1.8054371 1.219527 19.6267894 1.002916 0.20659753 0.99228511
## absences 1.0848599 1.022825 36.9319701 1.000307 0.03933340 0.12816947
## health 1.1387751 1.092594 4.3383139 1.152858 -0.04154535 0.30645324
## famrel 0.7598363 1.137448 0.1185288 1.033507 -0.52884754 -0.02203822
m <- glm(high_use ~ failures+absences+famrel, data = alc, family = "binomial")
here, we take the model from before, and based on it calculate the probability that a certain student (row in data) has the attribute “Strong drinker”.
Then, we use the probabilities to make a prediction of high_use.
The model takes actual values from the data, and based on them estimates how likely it is that a certain row (student) has the attribute (heavy drinker) These prediction may, or may not, match what the actual value was on each row.
alc$predicted_probabilities <- predict(m, type = "response") #type determines which outcome is given, see ?predict
alc <- mutate(alc,prediction = predicted_probabilities > 0.5) #Selects those, for which the model indicates should have more than 50% prob of being a heavy drinker. T if over 50, F if not.
print(x)
## prediction
## high_use FALSE TRUE
## FALSE 242 17
## TRUE 92 19
print (y)
## prediction
## high_use FALSE TRUE
## FALSE 65.405405 4.594595
## TRUE 24.864865 5.135135
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
Task: Find the proportion of “light drinkers” that the model misclassified as “heavies”
Set probability of being heavy drinker at 0%
Function takes:
Say, that it finds a student with a class of 0.
Substracts the probability (0) and class value. If class is 0 and prob is 0, result is also 0, lower than 0.5 (FALSE).This is a correct classification, as we just set the probability of being a heavy drinker at 0.
If it were a positive number, it’d have found a heavy drinker (1- 0 = 1, which is > 0.5, TRUE), despite the prob being 0 (=found a misclassed case)
A converse case: we want to find cases in which class is 0, even though we set the probability of being heavy drinker at 100 %
Conversely, if it was set at 1 (100% likelihood of being a heavy
drinker), and the function finds a class of 1, the function would look
like: abs(1-1) = 0, which is not > 0.5. We would have a correctly
classed case.
If it finds a misclassified one, it’d be (0-1) = -1, which as an absolute number (drop the minus) is higher than 0.5.
In each scenario, the function also calculates the mean
Here, we set the function to find students of class 0, and check if they are mismatched as 1
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7
The opposite case (0’s found when probability was 1) was 0.7.
Model appears to misclassify 0:s as 1:s almost doubly more often, than the opposite way.
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2918919
further delving into the bonus section if time allows…
Not included, as this time it is listed as the final assignment and deals with next week’s data. This has been completed; create_human.R” is available in the repo under Scripts.
require(tidyverse);require(here);require(MASS);require(corrplot); require(GGally)
data(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Overall the dimensions are 506 x 14.
All of the 14 variables appear numerical (integer or double).
Next, we look up what each of the 14 vars represents.
The dataset has an official documentation.
The data contains information on housing in the Boston region, USA.
Included variables are:
Next, graphical and table format summaries are generated for the data
First, summary as a table
summary(Boston)
ggpairs(Boston)
There’s quite a lot of options on what to look at here. I’m going to cherry pick some findings, instead of going through every variable.
Multiple variables have skewed distributions. For example:
Also, several variables have bimodality (= more than 1 peak in the distribution, meaning that 2 values are more common than others)
Scatterplot Indus x NOx appears to show 2 distinct groupings? For most data points, both NOX concentrations and the share of industrial acreage are low (these have a strong, statistically significant correlation coeff too!)
Crime rate appears to have a statistically significant correlation with almost all of these vars. Seems to correlate positively with the proportion of industrial acreage, NOX concentrations, large residential land areas…
The distribution of “indus” (business acreage) shows bimodality: we have two peaks, indicating that a couple of values are considerably more common than others. This variable also correlates w. high statistical significance with NOX emissions, which makes sense as the variable represents the prevalence of business acreage like industry.
Distribution of NOX is strongly skewed towards small values.
Age skews strongly towards high values. Overall, most construction in the regions of the data was done prior to 1940.
Again, we see bimodality in property tax rates. Low and high ends of the spectrum have clear peaks.
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
summary(Boston)
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- data.frame(boston_scaled, crime)
boston_scaled$crim<-NULL
*Dividing the data to training set and testing set
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = nrow(boston_scaled) * 0.8)
# create train set and test set
train <- boston_scaled[ind,] #Randomly selects 80% of rows, index numbers
test <- boston_scaled[-ind,] #everything except the indices in the training set
lda.fit <- lda(crime ~ . , data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale =2)
correct_classes<-test$crime
test$crime<-NULL
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
*Clearing .env, reload Boston raw data and scale it
data(Boston)
Boston_scaled<-as.data.frame(scale(Boston))
*Calculating distances between data points
dist_eu <- dist(Boston_scaled)
Run K-means clustering algorithm on the scaled data here, we run it on 6 centers as optimization will follow.
# k-means clustering
km <- kmeans(Boston_scaled, centers = "6")
pairs(Boston_scaled[1:6], col=km$cluster)
*Quite a confusing table…K of 6 is arguably too many.
*Optimizing K
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# MASS, ggplot2 and Boston dataset are available
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
km <- kmeans(Boston_scaled, centers = 2)
pairs(Boston_scaled[1:5], col = km$cluster)
pairs(Boston_scaled[6:10], col = km$cluster)
pairs(Boston_scaled[c(1,10)], col = km$cluster)
pairs(Boston_scaled[c(3,5)], col = km$cluster)
pairs(Boston_scaled[c(7,14)], col = km$cluster)
pairs(Boston_scaled[c(1,14)], col = km$cluster)
data("Boston")
Boston_scaled<-as.data.frame(scale(Boston))
km <- kmeans(Boston_scaled, centers = "5")
lda.fit <- lda(km$cluster ~ . , data = Boston_scaled)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(km$cluster)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale =2.4)
#More zoom
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale =20)
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
*Calculate matrix products
lda.fit <- lda(crime ~ . , data = train)
model_predictors <- dplyr::select(train, -crime)
dim(model_predictors)
dim(lda.fit$scaling)
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
3D Plot
Colored by the crime classes of the training data, we get the pink “High crime” data points grouped on their own (some overlap with med-high, though)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)
(more chapters to be added similarly as we proceed with the course!)